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Abstract. Abelian string matching problems are becoming an object 
of considerable interest in last years. Very recently, Alatabbi et al. [I] 
presented the first solution for the longest common Abelian factor prob¬ 
lem for a pair of strings, reaching 0{an^) time with 0{anlogn) bits of 
space, where n is the length of the strings and a is the alphabet size. In 
this note we show how the time complexity can be preserved while the 
space is reduced by a factor of a, and then how the time complexity can 
be improved, if the alphabet is not too small, when superlinear space is 
allowed. 


1 Introduction 

The longest common Abelian factor (LCAF) problem, posed at the String Mas¬ 
ters 2013 meeting by Thierry Lecroq and Arnaud Lefebvre, can be stated like 
that: Given two strings A and B, both of length n, over the alphabet E, compute 
the maximal length of a factor in A such that there exists a factor in B being its 
permutation (i.e., being an Abelian match). Moreover, it is desirable to return 
some (or all) occurrences of such factors in A and B. 

To our knowledge, the only work on this problem was presented very recently 
by Alatabbi et al. [T], in which they obtained 0{an^) worst-case time with 
0{anlogn) bits of space, where n is the length of the strings and a is the 
alphabet size. Further on, we will express the space in words, and the cited 
space becomes O(crn) words. 

While the Alatabbi et al. algorithm is simple, let us note that the same re¬ 
sult can be immediately obtained by a reduction from a well-known problem, 
the (standard) longest common factor (LGFji]. Hui [3] showed that using a gen¬ 
eralized suffix tree it is possible to find the LGF for a pair of strings of length n 
in 0{n) time. We use this algorithm n times, for each factor length £, replacing 
each £ symbol long factor by its Parikh vector followed with a unique terminator 
(e.g., for the factors taken from A the subsequent terminators can be —1, —2, 

^ Also known as the longest common substring (LCS) problem. We prefer the word 
“factor” in the problem name, to avoid confusion with the abbreviation for the 
longest common subsequence. 



while for the factors taken from B they can be —n — 1, —n — 2, ...). The ter¬ 
minators disallow to have matches longer than a. If the found LCF is of length 
exactly cr, it must correspond to a pair of factors, one from A and one from B, of 
length £. This is obtained in 0{an) time for one value of using 0{an) space, 
hence the total time, for all possible factor lengths, becomes 0{a'n?) with 0{an) 
space (we build and discard the generalized suffix trees one by one). In this way, 
we obtained the same time and space as Alatabbi et al. did. 


2 Preliminaries 

Let S' be a string of length n over an alphabet S of size cr = lAI. It can also be 
written as S[1... n], where S[i], 1 < J < n, denotes its i-th symbol. An analogous 
notation will be used for arrays. 

Throughout the note we assume that a = 0{n) and S = {1, 2,..., cr}. (If 
this is not the case, we can remap the alphabet for both input strings at the 
start with standard means, in 0(n log n) time and 0(n) extra space.) 

The Parikh vector for string S, denoted as P(S)[1... ct], is defined as a vector 
(array) of size cr storing the number of occurrences of each alphabet symbol in S. 
Formally, P(S')[c] = A: iff |{j : = c}\ = k, for any alphabet symbol c. For two 

strings S and T of equal length and over a common alphabet, we say that the 
Parikh vector P{S) is (lexicographically) smaller than the Parikh vector P{T), 
denoted as P{S) < P{T), iff there exists an alphabet symbol c', 1 < c' < cr, such 
that P(S')[c] = P(T)[c] for all c < c' and P{S)[c'] > P{T)[c']. The two Parikh 
vectors are equal, i.e., P{S) = P{T), when P(S')[c] = P{T)[c] for all symbols c. 

3 Reducing the space 

First, let us note that recently Kociumaka et al. [1] showed that for any tradeoff 
parameter 1 < t < n, the LCF problem can be solved in 0{t) space and 
Oir?jr) time. Applying this to the LCAF problem, we obtain Oirar?) time 
using OianjT) space, for any 1 < r < an. 

Yet, the specifics of LCAF allow for a better result. We consider each factor 
length i separately. For a given i, we sort all n — ^ -I- 1 factors of A according 
to their Parikh vectors, using the LSD radix sort. Each factor is represented as 
its start position in A. There are a passes of the radix sort and accessing the 
keys’ “digits” seems to be the soft spot of this variant. Yet, before each pass 
of the radix sort we scan A and for each ^-sized window collect the count of 
the corresponding symbol in it. More precisely, just before the i-th pass of the 
radix sort, in which the keys will be distributed according to P{-)[a — i 1], we 
compute and store P{A[j ... j + ^ — l])[a — i + 1] for each factor A[j .. .j + 
using 0{n) time and 0{n) extra space. Thanks to it, we can access a digit in the 
radix sort in constant time. After the i-th pass, the P{-)[a — i -I- 1] statistics are 
discarded. In this way, sorting of the £-long factors of A takes 0{an) time and 
its output (and working area) requires 0{n) words of space. 


We sort the factors of B in the same way. Additionally, for every cr-th evenly 
sampled t'-long factor of A and B, we store explicitly its Parikh vector using 0{a) 
space. More precisely, we compute and store the Parikh vectors for the factors 
A[1 .. .£], A[cr+1... ct+£], A[2tT+l... 2 ct+^], ..., and similarly for B[1... £],B[a+ 
1... cr + ^], B[2a + 1... 2ct + ^],.... As we scan the strings from left to right and 
compute the successive Parikh vectors incrementally (first making a copy of the 
previous vector), this phase takes 0{n + {n/a)a) = 0(n) time and 0{n) space. 

The computed Parikh vectors serve to speed up factor comparisons during 
the last phase, which is to intersect the lists of factors from A and B, similarly 
as in a binary merge operation. Thanks to the Parikh vectors kept in regular 
intervals of A and B, each factor comparison takes 0(cr) time, therefore the 
intersection takes 0{an) time. 

The total cost of the described procedure, over all relevant factor lengths, be¬ 
comes O(an^) and the required space is 0{n). This matches the time complexity 
of the Alatabbi et al. solution, yet the space usage is decreased by a factor of a. 

4 Reducing the time 

4.1 The general variant 

In this section we present a variant which achieves o{an?) time for the price of 
superlinear space. The key idea is to sort together factors of varying (yet close) 
lengths. 

The whole sorting phase runs in 0{n/k) steps, k < a, where in the Tth step 
the factors of both A and B of all lengths from ik + \ to {i + l)k are considered 
(yet, each group of factors, defined by their length, is sorted separately). The 
required space grows to 0{kn). To improve the time complexity, it is crucial to 
perform one step in o{kan) time. To this end, we make use of a data-oblivious 
sorting algorithm. An algorithm is called data-oblivious if its sequence of pos¬ 
sible memory accesses is independent of its input values. There exist such sort 
algorithms working in O(nlogn) worst-case time (assuming that keys can be 
accessed in constant time), see [2] and references therein. 

In our scenario, we compare the Parikh vectors of two factors of length ik + 1 
in 0{a) time and also collect all the positions i, 1 < i < a, at which the respective 
Parikh vectors have different values. These positions are inserted in bulk into a 
balanced binary search tree B, in 0{a) time. Let the two factors be A[u ... u+ik] 
and A[v ... v + ik\. The next comparison concerns the factors of length ik + 2: 
A[u .. .u + ik+ 1] and A[v .. .v + ik + 1\. Their Parikh vectors can be obtained 
with updating only one counter in the previous vectors, which can also affect B, 
as up to two elements should now be added to B and up to two elements should 
be removed B ■ The operations on B, including finding its minimum (or finding 
out that B is empty), which immediately serves to resolve the factor comparison, 
take 0(log|T|) = O(logcr) time. Similarly we handle the next pairs of factors, 
up to length {i + \)k. Each time when equal (in the Abelian sense) factors are 
found and one of them is from A and the other from B, we record their starting 


positions (in Aor B) and length. In this way, we cannot miss the longest Abelian 
matching factors. Note that in a comparison based sort, and in particular in a 
deterministic data-oblivious sort, it is impossible not to compare equal items at 
some moment, if such exist. To see this, imagine that we associate a real number 
with each item according to the sorted order; that is, the smallest item will have 
the smallest number and the largest item the largest number, and equal items 
will have equal associated numbers. Now, if two items, x and y, are equal and 
no other item in the collection is equal to x, not comparing a; to y in the sorting 
process would mean that x and y are indistinguishable. If, say, after the sorting x 
stands (just) before y and imagine x is modified in such a way that its associated 
value gets greater by e/2, where e is the minimum absolute difference between 
the associated values for any non-equal items in the collection, the hypothetical 
sort algorithm not comparing x to y would produce the same output as before, 
which of course means that the algorithm is incorrect. 

One step of the presented sort algorithm takes 0{{a + k\oga)n\ogn) time, 
which sums up to 0{{alk + log (T)n^ log n) time over all steps, and the space 
usage is 0{kn). Note that a space-time tradeoff is obtained with k between 2 
and a! logcr. For example, we can set k = which gives logn) time 

and (y^n) space. This time complexity is o{aii?) when cr = w(log^ n). 

4.2 Faster, sometimes 

In the algorithm above, the Parikh vectors of factors of length ik + 1 were 
compared in 0(a) time. Let us try to reduce this time, trying to obtain a better 
overall space-time tradeoff. 

To this end, for each length ik + \ 'we compute and store the Parikh vectors 
for factors of A and B sampled every d-th position, where d < a will be chosen 
later. Additionally, we compute the positions of the differences between each of 
the 0{n^ jd^) pairs of Parikh vectors, storing them in a balanced binary search 
tree, as described in the previous subsection. This requires overall 0{(jvA j{d^k)) 
extra time and 0{cnid‘ jd^) extra space. However, the “main” time component 
gets reduced to 0{{d/k + \oga)n^ logn). As we are interested in improving the 
space-time tradeoff, we need to check if d can be set to such value that the space 
complexity is not compromised, yet the time complexity improves, at least for 
some k and a. Clearly, it requires that ard' jdd = 0{kn), i.e., d = n{y^anfk). As 
only d = o(a) may improve the time complexity, we need to have njk = o{a) (and 
of course cr = w(l)). An extra requirement is k = o(aj logcr). Finally, improving 
the time complexity means that {djk -I- logcr)n^ logn -I- an^f{d‘^k) = o{{ajk + 
loga)n? logn), which does not introduce an extra constaint since and/(d^k) = 
0(n‘^), given the aforementioned lower bound on d. 

We set d = 0(a/ anfk). This implies d = u}{\/n loga) and thus also a = 
uj{^n\ogn), which eventually gives d = uj{y/nlogn). 

To sum up, if cr = a; (v^nlogn) and k = uj{nja) but also k = o{aj logcr), by 
choosing d = 0{^/anJk) we preserve the 0{kn) space and improve the time to 
0{{^/aJl/W + logcr)n^ logn). In most cases the improvement is not large: for 
example, if ct = n*^ ® and k = n^'^, the time complexity is slashed by a factor 



of n°-^. On the other hand, if e.g. a — 0(n/logn) and k = / logn), then 

the time complexity becomes 0((logcr)n^ logn), an improvement by a factor of 
ni/3. 

5 Conclusions 

Finding the longest common Abelian factor is a recently posed problem, with a 
solution given in [T], achieving 0(a'n?) worst-case time and needing 0{crn) words 
of space. A significant weakness of that result is its space requirement, which 
may be unacceptable with a larger alphabet. In this work we improve this result 
in two ways. 

One algorithm keeps the time complexity of the previous result, while it 
reduces its space to 0{n). This is obtained with very simple means (the key 
component is the LSD radix sort). The other algorithm of ours increases the 
space to 0{kn) and achieves the time complexity of 0{{afk + logtT)n^ logn), 
where k < a/Yoga is a freely chosen parameter. When a = a;(logn log logn) it 
is always possible to choose such k that this algorithm beats the result from [1] in 
both time and space complexity. This variant is also simple conceptually, yet it 
makes use of a deterministic data-oblivious sort algorithm of optimal complexity 
in the comparison based model. There are several such algorithms known, but 
none of them is really simple. A more practical choice could be the textbook Shell 
sort algorithm with the sequence of gaps of the form proposed by Pratt in 
1972 [^. Applying this Shell sort variant would deteriorate our time complexity 
by a factor of logn. The latter of the two algorithms is also improved slightly 
for convenient values of a and k. 

We are convinced that better algorithms for the LCAF problem are possible. 
One obvious line of attack is using word-level parallelism (in the word-RAM 
model) for Parikh vector comparisons. The anticipated speed-up factor is how¬ 
ever only about w/ log(n/cr), where w is the machine word size. A more interest¬ 
ing question is whether sharing computations for different factor lengths could 
be exploited with a stronger effect than presented here. 
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